Project: Investigate a Dataset - [The Movie Data Base]

Table of Contents

Introduction

Dataset Description

The movie database contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue. The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.

Questions for Analysis

  • Question 1: How did the production rate change across years?
  • Question 2: What is the production rate for top 5 genres overall each genre each year and what is the most dominant genres?
  • Question 3: What is the net profit for top 5 genres each year?
  • Question 4: Who are the top 10 most successful directors when it comes to net profit?
  • Question 5: Is there is a correlation between average rating and net profit?
  • Question 6: Who are the top 10 most diverse directors when it comes to genres?

Data Wrangling and Cleaning

  1. Reading CSV
  2. Dropping Unnecessary Columns
  3. Dropping Rows With Duplicates and Null Values
  4. Inspect and Fix Data Types
  5. Fixing Columns that has a separators between the names

1. Reading CSV file

2. Drop Unnecessary columns

I will drop a list of columns I am not going to use in my analysis based on the questions I am going to ask.

3. Dropping Rows With Duplicates and Null Values

Adding new column for the net profit

4. Inspect and Fix Data Types

Data types don't need any fixing.

5. Fixing the columns that has a " | " separator between names

Defining a function

Defining a function that take data frame column name we want to split into multiple columns\ and returns the maximum number of columns we are splitting on

Defining a function

Defining a function that take string of column name and max number of columns we want to split on\ and returns a list of names of the new columns

Splitting the columns cast , genres and production_companies

I will split each of the columns into multiple columns with new names using .str.split method

Dropping the columns cast , genres and production_companies

The dataset is now clean and ready to be explored.

Exploratory Data Analysis

Research Question 1: How did the production rate change across years?

Research Question 2: What is the production rate for top 5 genres overall each genre each year and what is the most dominant?

Research Question 3: What is the net profit for top 5 genres each year?

Research Question 4: Who are the top 10 most successful directors when it comes to net profit?

Research Question 5: Is there is a correlation between average rating and net profit?

Research Question 6: Who are the top 10 most diverse directors when it comes to genres?

Research Question 1: How did the production rate change across years?

Here we are plotting the the frequency of the movie production over the years.

The plot utilizes hovering so you can hover over the year and see the total number of movies produced in this year.

Conclusion.

Research Question 2: What is the production rate for top 5 genres per year and what is the most dominant genres?

Defining a function

Defining a function that take a dafaframe, column name and columns count\ and returns a frequency count of all the columns combined

Here I am getting the top 5 genres overall

Here we are plotting the total number of movies and the proportional number of movies produced each year for the top 5 genre.

The plot utilizes hovering so you can hover over the year and the genre to see the actual number of movies produced in this year for this specific genre or the percentage of specific genre in specific year.

Conclusion.

  • We can see that the drama genre is the most dominating genre.
  • We can also conclude that the rate of production for each genre almost stayed the same over the years.

Research Question 3: What is the net profit for top 5 genres each year?

Minor issue to be fixed

There is a minor issue where some of the movies has 0 budjet_adj or revenue_adj

so I cleaned the data frame here for the sake of net_profit analysis

Here we are plotting the net profit of each of the top 5 genres and the proportional net profit of each of the top 5 genres for each year.

The plot utilizes hovering so you can hover over the year and the genre to see the net profit for this year or the percentage of net profit for the genre in specific year.

Conclusion.

Research Question 4: Who are the top 10 most successful directors when it comes to net profit?

Here we are plotting the total net profit for top 10 directors.

The plot utilizes hovering so you can to see the total net profit and the director.

Conclusion.

Research Question 5: Is there is a correlation between average rating and net profit?

Here we are plotting the relation between average rating and net profit.

The plot utilizes hovering so you can see the net profit and average rating.

Conclusion.

Research Question 6: Who are the top 10 most diverse directors when it comes to genres?

Here we are plotting the top 10 directors when it comes to genres diversity.

The plot utilizes hovering so you can see the genre, genre count, director and total count.

Conclusion.

Limitations

The limitation I found was that there is a large number of movies that had zero budget and revenue, so I had to remove this movies when I was manipulating anything that comes to profit which limits the data and decreases the confidence regarding the revenue questions.